Skip to content

feat(docker): production-optimized multi-stage Dockerfile#90

Closed
pratikbin wants to merge 254 commits intochopratejas:mainfrom
pratikbin:feat/production-dockerfile
Closed

feat(docker): production-optimized multi-stage Dockerfile#90
pratikbin wants to merge 254 commits intochopratejas:mainfrom
pratikbin:feat/production-dockerfile

Conversation

@pratikbin
Copy link
Copy Markdown
Contributor

Summary

  • Multi-stage build: build deps (gcc/g++) in builder stage, runtime is clean python:3.11-slim with only curl
  • uv with lockfile instead of raw pip for deterministic, fast installs with build cache mounts
  • Dep layer cached independently from source code — source-only rebuilds drop from ~37s to ~4s
  • Non-root headroom:1000 user instead of root
  • Proper ENTRYPOINT/CMD separation for clean docker-compose overrides
  • CI workflow for multi-arch (linux/amd64 + linux/arm64) image publishing to GHCR
  • Expanded .dockerignore to exclude JS artifacts, IDE files, Docker meta-files

Benchmarks (cold build, Apple Silicon / OrbStack)

Metric Old New Improvement
Image size 1.11 GB 514 MB -54%
Cold build 55.6s 32.2s -42%
Rebuild (source change) 36.8s 3.9s -89%

Closes #89

Test plan

  • Local build succeeds (docker build -t headroom:local .)
  • ARM64 build succeeds (docker buildx build --platform linux/arm64)
  • Container starts, /health returns 200
  • CLI --help works inside container
  • CI workflow triggers on release and publishes to GHCR

chopratejas and others added 30 commits January 30, 2026 16:27
Use asyncio.run() instead of asyncio.get_event_loop().run_until_complete()
which raises RuntimeError in Python 3.10+ when no event loop exists.
Move the network timeout skip handler to the main tests/conftest.py
so it applies to all tests, not just tests/test_memory/*.

Fixes flaky CI failures when HuggingFace model downloads timeout.
- Add mkdocs.yml with Material theme (indigo, professional)
- Add docs/index.md landing page with quick install
- Add GitHub Actions workflow for auto-deployment
- Remove old docs/README.md (replaced by index.md)
- Add web dashboard at /dashboard endpoint with real-time stats
- Simplify dashboard metrics to user-friendly terms (removed confusing
  CCR/TOIN terminology)
- Track Headroom overhead separately from total latency
- Add request logging to Bedrock paths (was missing)
- Use package version (__version__) instead of hardcoded "1.0.0"
- Add latency min/max tracking in addition to average

Dashboard shows: requests, tokens saved, cost saved, overhead,
providers breakdown, performance stats, and recent requests table.
- Add dashboard URL (http://localhost:8787/dashboard) to quickstart
- Recommend headroom-ai[all] for best compression performance
- Note that first startup downloads ML models (~500MB one-time)
## Description

Add Headroom integration with AWS Strands Agents SDK, enabling automatic
context optimization and tool output compression for Strands-based agents.

Fixes chopratejas#14

## Type of Change

- [x] New feature (non-breaking change that adds functionality)
- [x] Documentation update

## Changes Made

### Core Integration (`headroom/integrations/strands/`)

- **HeadroomHookProvider** - Implements Strands `HookProvider` interface for
  automatic tool output compression via `AfterToolCallEvent`. Compresses
  verbose tool outputs before they enter conversation context.

- **HeadroomStrandsModel** - Model wrapper that extends Strands `Model` base
  class for message-level optimization. Implements all required abstract
  methods: `stream()`, `get_config()`, `update_config()`, `structured_output()`.

- **Provider auto-detection** - Automatically detects appropriate Headroom
  provider (Anthropic, OpenAI, Google) based on wrapped Strands model type.

- **`strands-agents` as optional dependency** - Install with
  `pip install headroom-ai[strands]`

### Testing (`tests/integrations/test_strands/`)

- **Real integration tests (25 tests)** - Use actual AWS Bedrock API calls
  with Claude 3 Haiku. Skip automatically when credentials unavailable.

- **Unit tests (57 tests)** - Mock-based tests for internal logic, edge cases,
  and error handling. No credentials required.

### Demo (`examples/strands_bedrock_demo.py`)

- Interactive demo showcasing both integration patterns
- Visual before/after compression comparison with token savings
- 4 verbose tools (search, logs, database, metrics) demonstrating real savings
- Supports `--hook` and `--model` flags for individual demos

## Testing

All tests verified:

- [x] Unit tests pass (57 tests)
- [x] Integration tests pass (25 tests with real Bedrock API)
- [x] Linting passes (`ruff check .`)
- [x] Type checking passes (`mypy headroom/integrations/strands/`)
- [x] Formatting passes (`ruff format --check`)
- [x] Demo runs successfully with ~50% token savings

## Test Output

```
$ pytest tests/integrations/test_strands/ -v
=================== 82 passed in 90.09s ===================

$ ruff check headroom/integrations/strands/ --ignore E402
All checks passed!

$ mypy headroom/integrations/strands/ --ignore-missing-imports
Success: no issues found
```

## Demo Results

```
╭────────────────────────────────────────────────────────────╮
│              HeadroomHookProvider Results                  │
│────────────────────────────────────────────────────────────│
│ Tokens BEFORE compression: 51,961                          │
│ Tokens AFTER compression:  25,658                          │
│ Tokens SAVED:              26,303 (50.6%)                  │
╰────────────────────────────────────────────────────────────╯
```
DiffCompressor:
- Parse unified diff format and compress by reducing context lines
- Preserve file headers and all +/- change lines
- Score hunks by relevance (error keywords, query matches)
- Add summary line: [N files, +X -Y lines]
- Expected 30-50% savings on typical git diffs
- Wire into content router for CompressionStrategy.DIFF
- 30 tests covering parsing, compression, edge cases

hnswlib SIGILL fix:
- Move hnswlib import from module level to lazy loading
- hnswlib crashes with SIGILL (Illegal Instruction) on CPUs
  without AVX support, before Python can catch the error
- Now imports only when HNSWVectorIndex is actually used
- HNSW_AVAILABLE is checked lazily via __getattr__

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HTMLExtractor uses trafilatura to extract main content from HTML pages,
removing scripts, styles, navigation, and ads. This achieves 94.9%
compression while preserving 98.2% recall on the Scrapinghub benchmark.

Key features:
- Automatic HTML detection in content router
- Configurable output format (markdown or text)
- Metadata extraction (title, author, date, description)
- Batch extraction support

Evaluation framework:
- OSS benchmark integration (Scrapinghub Article Extraction Benchmark)
- LLM-as-judge evaluation for QA accuracy preservation
- F1 score: 0.919 on 181-sample benchmark (baseline: 0.958)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename client/response variables to be unique per provider branch
to avoid type inference conflicts. Use getattr for Anthropic content
block text access to handle union types.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add native support for OpenRouter API via LiteLLM backend
- Introduce PROVIDER_REGISTRY pattern to eliminate scattered if/else blocks
- New providers can now be added with a single registry entry

Features:
- `headroom proxy --backend openrouter` routes requests to OpenRouter
- Pass-through model naming (anthropic/claude-3.5-sonnet, openai/gpt-4o, etc.)
- CLI shows provider-specific setup instructions from registry

Usage:
  export OPENROUTER_API_KEY="sk-or-v1-..."
  headroom proxy --backend openrouter

Also fixes mypy type errors in mcp_server.py
…ic tools

Add ability to exclude specific tools from compression, useful for CLI tools
like Claude Code where file/search output should be passed through unmodified.

Changes:
- Add DEFAULT_EXCLUDE_TOOLS constant with Read, Grep, Glob, Bash, WebFetch, WebSearch
- Add exclude_tools field to SmartCrusherConfig and ContentRouterConfig
- Add _build_tool_name_map() to ContentRouter for tool_call_id -> name mapping
- Skip compression for tool_result blocks from excluded tools
- Support both Anthropic (tool_use/tool_result) and OpenAI (tool_calls/tool) formats

This prevents Headroom from compressing output from tools where the user
expects to see the full, unmodified content (e.g., file reads, search results).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add pytest.importorskip("trafilatura") to HTML extractor test modules
to skip tests gracefully when the optional trafilatura dependency is
not installed. This fixes CI failures in the base test matrix that
doesn't include the html extras.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous fix attempted to lazily import hnswlib but calling
_check_hnswlib_available() still triggered the import, which crashed
with SIGILL on CPUs without AVX support before Python could catch it.

Fix by using subprocess to safely probe for hnswlib availability:
- Import AND create an Index in a subprocess to catch SIGILL at both
  import time and first use of AVX instructions
- If subprocess succeeds, then import in main process
- Add debug logging for all failure modes (timeout, crash, etc.)
- Isolates any crash to the subprocess, keeping test process alive

AI review: code-reviewer (1 iteration)
Adversarial review: code-critic (addressed logging, more robust probe)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add pytestmark skip conditions to memory test modules that depend on
hnswlib (core_operations, factory, easy). The subprocess probe for
hnswlib correctly detects unavailability on some platforms (like
Python 3.13 CI runners), but these tests were still trying to run
and failing with ImportError.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## What this PR fixes

1. **CI Python 3.12 failure**: Added skip decorator to `TestLocalBackend`
   in `test_memory_system.py` - these tests require hnswlib which is not
   available on all CI runners.

2. **Missing test coverage**: Added 6 tests for the `exclude_tools` feature
   in `test_content_router.py`. Tests use existing helper functions
   `generate_python_code()`, `generate_json_data()`, and
   `generate_search_results()` defined at lines 57-95 of the same file.

3. **Anthropic/OpenAI inconsistency**: Fixed `_process_content_blocks()`
   to add `router:excluded:tool` marker for Anthropic format, matching
   the OpenAI format behavior at line 1157.

4. **Dead code removal**: Removed unused `exclude_tools` field from
   `SmartCrusherConfig` - the actual implementation uses
   `ContentRouterConfig.exclude_tools` in content_router.py.

AI review: code-reviewer (2 iterations), adversarial-reviewer (2 iterations)
Issues fixed: missing test coverage, format inconsistency, dead code

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WebFetch and WebSearch should NOT be excluded by default because:
1. Web content is Headroom's sweet spot - lots of noise (nav, ads, boilerplate)
2. CCR allows retrieval if LLM needs original content
3. Excluding them undermines the core value proposition

DEFAULT_EXCLUDE_TOOLS now only contains local file/code tools:
- Read, Glob, Grep, Bash (and lowercase variants)

These local tools return precise content (line numbers, paths, code)
where exact fidelity matters immediately. Web tools benefit from
compression and can use CCR for on-demand retrieval.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…e-tools-compression

feat(compression): add exclude_tools to bypass compression for specific tools
- Add `question` parameter to LLMLinguaCompressor.compress() for QA-aware
  token selection (passes to LLMLingua-2's compress_prompt)
- Flow `question` parameter through ContentRouter compression pipeline
- Enable ContentRouter in default pipeline (was missing, causing 0% compression)
- Add `content_router_enabled` config option to HeadroomConfig

This improves compression accuracy for QA tasks by allowing LLMLingua-2 to
preserve tokens relevant to answering the given question.
Root Cause:
The `find_tool_units()` function in `parser.py` only detected OpenAI
format tool calls (assistant.tool_calls + role="tool" messages), not
Anthropic format (assistant.content[type=tool_use] + user.content[type=tool_result]).

This caused RollingWindow and IntelligentContext transforms to treat
Anthropic tool_use and tool_result as separate, independently droppable
messages. When context needed to be trimmed, the assistant message with
tool_use could be dropped while keeping the user message with tool_result,
creating orphaned tool_result blocks.

When sent to the Anthropic API, this produces the error:
"unexpected tool_use_id found in tool_result blocks"

Changes:
1. parser.py: Extended `find_tool_units()` to detect Anthropic format:
   - Scan user messages for content blocks with type="tool_result"
   - Scan assistant messages for content blocks with type="tool_use"
   - Map tool_use_id to corresponding response message indices

2. rolling_window.py: Extended `_get_protected_indices()` to protect
   Anthropic format tool pairs:
   - Detect tool_use blocks in assistant.content
   - Find and protect matching user messages with tool_result blocks

3. tests/test_parser.py: Added 4 new tests for Anthropic format:
   - test_anthropic_format_tool_use_and_result
   - test_anthropic_format_multiple_tool_uses
   - test_anthropic_format_orphaned_tool_result
   - test_mixed_openai_and_anthropic_formats

Test Results: 82 passed (including 4 new Anthropic format tests)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ropic-tool-unit-detection

fix: Handle Anthropic format tool_use/tool_result as atomic units
…ager

Follows up on PR chopratejas#19 which fixed RollingWindow but missed IntelligentContextManager,
the default context manager used by the proxy.

Changes:
1. intelligent_context.py: Extended `_get_protected_indices()` to handle Anthropic format:
   - Scan assistant.content for type="tool_use" blocks
   - Protect user messages containing type="tool_result" blocks with matching tool_use_id

2. test_intelligent_context.py: Added TestAnthropicFormatToolProtection class with 5 tests:
   - test_anthropic_tool_result_protected_when_tool_use_protected
   - test_anthropic_tool_units_dropped_atomically
   - test_anthropic_multiple_tools_same_message_atomic
   - test_anthropic_format_no_api_error_scenario (verifies bug fix)
   - test_mixed_openai_and_anthropic_formats

3. test_rolling_window.py: Added matching TestAnthropicFormatToolProtection class with 5 tests
   - Added skip decorator for CI/CD when OPENAI_API_KEY is not set

This ensures both context managers (RollingWindow and IntelligentContextManager) correctly
handle Anthropic's native tool_use/tool_result format, preventing the
"unexpected tool_use_id found in tool_result blocks" API error.
…ntext-anthropic-format

fix: Extend Anthropic format tool protection to IntelligentContextManager
- Add MemoryToolAdapter for unified memory across providers
- Anthropic: Uses native memory tool (memory_20250818) for subscription safety
- OpenAI/Gemini/Others: Uses function calling format
- All providers share the same semantic vector store backend
- Simplify CLI to single --memory flag with auto-detection
- Add proper resource cleanup (close methods) to fix test isolation
- Update README with memory documentation
Implements comprehensive memory tracking for all in-memory components:

- Add MemoryTracker singleton with ComponentStats, ProcessStats, MemoryReport
- Add get_memory_stats() to CompressionStore, BatchContextStore,
  GraphStore, HNSWVectorIndex
- Add /debug/memory API endpoint for runtime monitoring

Components tracked:
- compression_store: CCR compressed tool outputs
- batch_context_store: Batch API request contexts
- graph_store: Knowledge graph entities and relationships
- vector_index: HNSW vector embeddings
- semantic_cache: Response cache
- request_logger: Request metadata

Includes 47 tests (unit + integration) with real API calls.
…ervability

Add memory observability system (Phase 1)
Replace unbounded InMemoryGraphStore with SQLite-backed implementation:
- Persistent storage survives proxy restarts
- Memory bounded by configurable SQLite page cache (default 8MB)
- Same async interface as InMemoryGraphStore (drop-in replacement)
- LocalBackend now uses SQLiteGraphStore by default (graph_persist=True)

New files:
- headroom/memory/adapters/sqlite_graph.py: SQLite graph store implementation
- tests/test_sqlite_graph_store.py: 37 comprehensive tests

Key features:
- O(log n) lookups via database indexes
- Case-insensitive entity name lookup per user
- BFS subgraph traversal and shortest path finding
- CASCADE delete for entity relationships
- MemoryTracker integration via get_memory_stats()
…ervability

Add SQLiteGraphStore for bounded, persistent graph storage
chopratejas and others added 25 commits March 30, 2026 16:25
…d tool

Bedrock requires role=tool messages immediately after assistant tool_calls.
The previous fix inserted a user text message in between when the message
contained both text and tool_result blocks, breaking the pairing.

Drop text alongside tool_result (Claude Code never sends it in practice).
Added ordering regression tests for the Bedrock constraint.
The /v1/responses handler was passing through without compression,
meaning Codex CLI users got zero savings. Now converts Responses API
items (function_call, function_call_output, reasoning, message) to
Chat Completions format, runs the existing pipeline, and converts back.

- New: headroom/proxy/responses_converter.py — pure conversion functions
- 21 unit tests + 3 integration tests (tested with real OpenAI API)
- Preserves reasoning items, images, unknown types verbatim
- Skips compression when previous_response_id is set
- 27% compression on real Codex-pattern payloads (500 records → 14K tokens saved)

Closes chopratejas#73
Private scripts with credentials should not be tracked in git.
Codex v0.117.0+ with newer models uses WebSocket instead of HTTP POST
for the Responses API. Added @app.websocket("/v1/responses") handler that:
- Accepts ws:// connections and forwards to wss://api.openai.com
- Compresses input on first message using existing pipeline
- Relays all response events bidirectionally
- Handles SSL (certifi), graceful disconnect, missing websockets lib

Tested with real OpenAI API: basic text, large tool outputs (200 records),
parallel function calls, instructions preservation.

Addresses chopratejas#79
KompressCompressor now tries ONNX Runtime first (156MB INT8 model),
falls back to PyTorch only if ONNX unavailable. No torch needed for
text compression — just onnxruntime (~50MB) + transformers (tokenizer).

Changes:
- Add onnxruntime + transformers to [proxy] extra in pyproject.toml
- Add _OnnxModel wrapper with get_scores/get_keep_mask interface
- _load_kompress() tries ONNX first, falls back to PyTorch
- is_kompress_available() returns True if EITHER backend available
- compress() handles both numpy (ONNX) and tensor (PyTorch) outputs

Dependency impact:
  Before: pip install headroom-ai[proxy] → no text compression
  After:  pip install headroom-ai[proxy] → Kompress ONNX INT8 (156MB)
  [ml] extra still available for full PyTorch (600MB, GPU support)
- WebSocket proxy for /v1/responses (Codex gpt-5.4+ support)
- Kompress ONNX INT8 text compression (no torch needed, ~100MB vs 1.5GB)
- Updated Discord invite link
- Tool_result ordering fix for Bedrock
OpenAI's WebSocket Responses API requires the header
'OpenAI-Beta: responses-api=v1'. Without it, the server returns HTTP 500
on the WebSocket upgrade. Also forward all client headers (not just auth)
to upstream, skipping only hop-by-hop headers.

Tested with real OpenAI API: basic text, large tool output compression,
all working through WebSocket proxy.

Fixes chopratejas#82, updates chopratejas#79
…history

feat: persist proxy savings history
savings_usd is now tokens_saved * model list input price (monotonic,
transparent). Removed non-monotonic moving-average repricing and
confusing cost_without_headroom counterfactual.

Dashboard hero shows "Compression Savings" with clear subtitle.
Savings Breakdown section shows compression, cache, and RTK separately
with distinct colors and no scope mixing.

All beacon/telemetry fields preserved. RTK token counts still reported.

Fixes chopratejas#83
…0.5.17

- Fix beacon spam: file lock ensures only one beacon per proxy regardless
  of worker count. Workers > 1 caused N beacons firing N rows per cycle.
- Beacon upsert: on_conflict=session_id prevents duplicate rows.
- Beacon stop() guard: skip final report if uptime < 2 minutes.
- Fix dashboard cost: savings_usd now uses model list price (monotonic),
  not moving average. Separate breakdown for compression/cache/rtk.

Fixes chopratejas#83
feat: Add Anthropic url overrides via env vars and cli flags
transforms_summary is a counted dict (e.g. {"router:tool_result:text": 4})
alongside the raw transforms_applied list. Cleaner display for users
without losing the raw data for debugging.
Move ProxyConfig, RequestLog, CacheEntry, RateLimitState to
headroom/proxy/models.py. Re-exported from server.py for backward
compatibility — all existing imports continue to work.

server.py: 8835 → 8643 lines (-192)
models.py: 199 lines (new)

Part of the server.py split effort to improve maintainability.
…nd forwarding

- Fix WS /v1/responses: forward Sec-WebSocket-Protocol (subprotocol) to
  upstream instead of stripping it — root cause of Codex HTTP 500 errors
- Fix WS relay: handle binary messages properly instead of crashing on
  .decode(), add debug logging instead of silent except:pass
- Add Authorization header fallback from OPENAI_API_KEY env var for WS
- Extract response body from websockets InvalidStatus for error debugging
- Fix streaming /v1/responses: pass optimized_tokens (not original_tokens
  twice) so compression savings appear in streaming metrics
- Fix hardcoded provider="bedrock" in 4 metrics/log locations — now uses
  self.anthropic_backend.name so LiteLLM backends report correctly
- Forward --backend, --anyllm-provider, --region flags from wrap commands
  (codex, aider) to the proxy subprocess via _start_proxy()
- Forward API key from request headers to LiteLLM acompletion() calls
- Forward region to Vertex AI (vertex_location) not just Bedrock
- Redesign proxy startup banner: show routing table instead of misleading
  "Backend: Anthropic" label

Closes chopratejas#86

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CacheAligner was hardcoded to enabled=True in the pipeline despite
the config default being False. It extracts dynamic content from the
system prompt middle and reinserts at the end, which:
1. CHANGES the prefix bytes → provider cache miss (loses 90% discount)
2. ADDS ~341 tokens of formatting overhead per request
3. Net effect: more expensive, worse caching

Now uses the config default (enabled=False). The CacheAligner still
exists for users who explicitly opt in, but the proxy no longer
forces it on.
Manual ssl.create_default_context() + certifi doesn't load the Windows
system certificate store, causing HTTP 500 on wss:// connections to
OpenAI. Using ssl=True lets the websockets library handle SSL natively
with proper cross-platform cert store loading.
Multi-stage build with uv, non-root user, and layer-optimized caching.

- Multi-stage: build deps (gcc/g++) stay in builder, runtime is clean slim
- uv instead of pip: uses existing uv.lock for deterministic fast installs
- Layer caching: deps cached independently from source (rebuild 37s -> 4s)
- Non-root: runs as headroom:1000 instead of root
- Image size: 1.11GB -> 514MB (-54%)
- Add CI workflow for multi-arch (amd64+arm64) GHCR publishing
- Expand .dockerignore to exclude JS artifacts, IDE files, Docker files

Closes chopratejas#89
@chopratejas
Copy link
Copy Markdown
Owner

Thanks for the changes.

Regd. the UI - we already have a bare bones dashboard in Headroom -

Do you think we should augment that?

@pratikbin pratikbin force-pushed the feat/production-dockerfile branch from 52550e4 to 653303f Compare April 2, 2026 19:24
@pratikbin
Copy link
Copy Markdown
Contributor Author

pratikbin commented Apr 2, 2026

yes, i like little dashboard and so i built one, if you can augment that, it would be wonderfull

mine
telegram-cloud-photo-size-5-6231093881541954852-y

ahh i hate git-filter-repo

@pratikbin pratikbin closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] official container image

10 participants